Skip to content

feat: implement Project.save() and Dataset.save() #503

Open
thiagobomfin-galileo wants to merge 7 commits intomainfrom
feat/sc-52119-project-dataset-save
Open

feat: implement Project.save() and Dataset.save() #503
thiagobomfin-galileo wants to merge 7 commits intomainfrom
feat/sc-52119-project-dataset-save

Conversation

@thiagobomfin-galileo
Copy link
Contributor

@thiagobomfin-galileo thiagobomfin-galileo commented Mar 13, 2026

User description

Shortcut:
Implement Missing CRUD Operations in future Package

Description:

  • Adds update() to the Projects and Datasets service classes, exposing the existing but unused
    PUT /projects/{id} and PATCH /datasets/{id} API endpoints
    • Implements Project.save() and Dataset.save() in the future domain layer, replacing
      NotImplementedError stubs with full state-dispatching logic

Tests:

  • Unit Tests Added
  • E2E Test Added (if it's a user-facing feature, or fixing a bug)

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Implement state-aware Dataset.save() and Project.save() so dirty entities dispatch to the new update helpers, transition their sync state, and surface API failures through logging and tests. Expose Datasets.update() and Projects.update() to call the PATCH/PUT endpoints with validation before returning refreshed models, satisfying the missing CRUD functionality in the future domain.

TopicDetails
Entity Save Flow Update Dataset.save() and Project.save() to call their backing services when dirty, keep sync-state transitions accurate, and add tests covering creation, no-ops, dirty updates, and API failures.
Modified files (4)
  • src/galileo/__future__/dataset.py
  • src/galileo/__future__/project.py
  • tests/future/test_dataset.py
  • tests/future/test_project.py
Latest Contributors(1)
UserCommitDate
vamaq@users.noreply.gi...fix-dataset-204-respon...February 07, 2026
Service Update APIs Expose Datasets.update() and Projects.update() to use the existing PATCH/PUT APIs with logging and response validation before yielding refreshed models.
Modified files (2)
  • src/galileo/datasets.py
  • src/galileo/projects.py
Latest Contributors(2)
UserCommitDate
bipin@galileo.aifix-normalize-ground_t...February 18, 2026
vamaq@users.noreply.gi...fix-dataset-204-respon...February 07, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

@thiagobomfin-galileo thiagobomfin-galileo self-assigned this Mar 13, 2026
@thiagobomfin-galileo thiagobomfin-galileo requested a review from a team as a code owner March 13, 2026 16:06
@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 62.50000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.14%. Comparing base (66e86cd) to head (897be4f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/galileo/datasets.py 11.76% 15 Missing ⚠️
src/galileo/projects.py 21.05% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #503      +/-   ##
==========================================
+ Coverage   82.13%   82.14%   +0.01%     
==========================================
  Files          96       99       +3     
  Lines        9282     9467     +185     
==========================================
+ Hits         7624     7777     +153     
- Misses       1658     1690      +32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +303 to +314
def update(self, project_id: str, *, name: Optional[str] = None) -> ProjectUpdateResponse:
"""
Updates a project's properties.

Parameters
----------
project_id : str
The ID of the project to update.
name : str, optional
The new name for the project.

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Projects.update() doesn't emit the required start/completion lifecycle logs (AGENTS.md requires write methods to log) and it doesn't log failure paths: ProjectsAPIException for non-200 responses and ValueError for empty responses are raised with no telemetry. These write failures are therefore silent in telemetry. Can we emit start/completion logs and log each failure path before raising?

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In src/galileo/projects.py around lines 303-343, the update method is missing lifecycle
logging and failure telemetry. Add an initial start log (e.g. _logger.info with
action='project.update', phase='start', project_id=project_id, name=name), and add a
completion log on success (phase='complete', project_id, name). Before raising
ProjectsAPIException on non-200 responses, log an error with response.status_code and
content; before raising for HTTPValidationError, log the validation error details;
before raising ValueError for an empty response, log an error indicating no response was
returned. Use the existing _logger and keep logs concise and structured so every failure
path is logged prior to raising.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 609bcb7 addressed this comment by adding lifecycle logs around Projects.update(), logging a start event, error events for non-200/validation/empty responses, and a completion event on success, ensuring every failure path records telemetry before raising.

Comment on lines +425 to +436
def update(self, dataset_id: str, *, name: Optional[str] = None) -> Dataset:
"""
Updates a dataset's properties.

Parameters
----------
dataset_id : str
The ID of the dataset to update.
name : str, optional
The new name for the dataset.

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets.update() performs a write but emits no lifecycle logs; AGENTS.md requires start/completion and failure logs with non-sensitive context. It raises DatasetAPIException/ValueError without logging failure details, so there's no audit trail or failure context for debugging. Can we add structured start/completion logs and log failures with non-sensitive context before raising to comply with AGENTS.md?

Finding type: AI Coding Guidelines | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In src/galileo/datasets.py around lines 425 to 460, the update method is missing
lifecycle logging. Add structured logs: log an INFO at the start of the update with
non-sensitive context {"dataset_id": dataset_id, "name": name}, log an INFO on
successful update with the same context plus the updated dataset id, and log an ERROR
before raising when the call returns HTTPValidationError, when response is empty
(ValueError), or on any other exception. Use the service logger available on the client
(e.g. self.config.logger) and include exception details in the error logs without
including sensitive payloads; ensure logs occur immediately before raising the
corresponding exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 609bcb7 addressed this comment by adding structured lifecycle logging to Datasets.update() with INFO logs at start and success and ERROR logs before raising for validation errors, empty responses, and exceptions.

@pratduv pratduv removed their request for review March 13, 2026 21:22
@thiagobomfin-galileo thiagobomfin-galileo force-pushed the feat/sc-52119-project-dataset-save branch from ef52dcd to 609bcb7 Compare March 17, 2026 18:02
Comment on lines +550 to +554
if self.sync_state == SyncState.LOCAL_ONLY:
return self.create()
if self.sync_state == SyncState.SYNCED:
logger.debug(f"Dataset.save: id='{self.id}' - already synced, no action needed")
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset.save now re-implements the entire state-transition + persistence workflow (LOCAL_ONLY -> create, SYNCED no-op, DELETED guard) before calling Datasets.update, which is essentially the same block that Project.save copies verbatim (see project.py lines 785‑809). Any change to the state checks, logging, or _set_state error handling now needs to be made in two places. Can we consolidate this into StateManagementMixin (e.g. a shared _persist_changes(create_fn, update_fn, attr_sync) helper) so dataset/project save reuse one canonical workflow and only supply the resource-specific update call?

Finding type: Code Dedup and Conventions | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation. The save methods are nearly identical 30-line blocks. However, this is a low-severity concern and the duplication is manageable for two classes. Extracting a shared helper now would be premature if more fields diverge between the two (which is likely given different API shapes).

  • Recommended action: Defer to a follow-up. If a third domain object (Prompt) also gets the same pattern, consolidate then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I’ll defer the refactor as suggested and will save this preference to memory after the PR is merged.

Comment on lines +563 to +571
logger.info(f"Dataset.save: name='{self.name}' id='{self.id}' - started")
datasets_service = Datasets()
updated_dataset = datasets_service.update(self.id, name=self.name)

# Update attributes from response
self.name = updated_dataset.name
self.updated_at = updated_dataset.updated_at

self._set_state(SyncState.SYNCED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset.save() claims to persist local modifications but only passes name to Datasets.update — should we include mutated fields such as draft in the update before marking SYNCED or restrict save() to renames and update its docstring?

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In src/galileo/__future__/dataset.py around lines 563 to 571, the save() method
currently calls Datasets.update(self.id, name=self.name) and then marks the instance
SYNCED, but it ignores other mutable fields (e.g. draft, column_mapping) so local edits
are not sent to the API. Change save() to build an update payload from all mutable
attributes that should be persisted (at minimum include draft and column_mapping, plus
any other updatable fields the class exposes), pass that payload into
datasets_service.update(self.id, **payload), and then apply the returned updated_dataset
attributes (name, updated_at, draft, column_mapping, etc.) before setting state to
SYNCED. Also update the docstring to accurately describe which fields are persisted. If
adding fields is not desired, instead narrow the docstring and behavior to state that
save() only persists name changes and raise/handle accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Analysis: Valid. The UpdateDatasetRequest model supports name, column_mapping, and draft. But save() only sends name, and only syncs back name and updated_at — silently dropping changes to draft or column_mapping. The docstring claims to persist "any local modifications" which is inaccurate.
  • Recommended action: Either (a) pass all mutable fields through the update call, or (b) narrow the docstring to say "persists name changes" and document the limitation. Option (a) is better for user expectations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the mismatch between the docstring and the actual fields being updated—either expanding the update payload or narrowing the docstring makes sense. Noted, I’ll save this to memory after the PR is merged.

Comment on lines +448 to +452
logger.info("action='dataset.update' phase='start' dataset_id=%s name=%s", dataset_id, name)

body = UpdateDatasetRequest(name=name)

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets.update sends name: null when name is None, violating the API's non-null name contract — should we only set name when provided (e.g. UpdateDatasetRequest(name=name if name is not None else UNSET))?

Finding type: Type Inconsistency | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In src/galileo/datasets.py around lines 448 to 452, the update method constructs
UpdateDatasetRequest(name=name) even when name is None which will serialize to name:
null and may clear a non-nullable dataset name. Change this so the request omits the
name field when the caller did not provide it: either pass name=UNSET when name is None
(using UNSET from galileo.resources.types) or build the request kwargs only with 'name'
if name is not None, then instantiate UpdateDatasetRequest(**kwargs). Ensure imports are
adjusted if needed and keep the existing logging/error handling intact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Analysis: The auto-generated UpdateDatasetRequest model defines name: Union["Name", None, Unset, str] = UNSET. Passing name=None sets it to None (not UNSET), which would serialize to null. However, in the actual call path from Dataset.save(), name is always self.name which is a string (never None) since a dataset must have a name. The service method's Optional[str] signature technically allows None, but the real caller never passes it. This is a minor API design issue (the Optional parameter type is slightly misleading), not a bug in practice.
  • Recommended action: Optionally tighten the signature or add a guard, but this is not a blocking issue since the only caller always passes a string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying the real call path and impact—this is noted as a minor API design concern rather than a functional bug, and I’ll remember this refinement after the PR is merged.

Comment on lines +797 to +805
try:
logger.info(f"Project.save: name='{self.name}' id='{self.id}' - started")
projects_service = Projects()
updated_project = projects_service.update(self.id, name=self.name)

# Update attributes from response
if not isinstance(updated_project.name, Unset):
self.name = updated_project.name
self.updated_at = updated_project.updated_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Project.save() only persists name but marks the project SYNCED, should we include other editable fields (at least type) in the update or prevent saving when only non-name fields changed?

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In src/galileo/__future__/project.py around lines 797 to 805, the save() method
currently only sends name to Projects.update and only applies updated_project.name back
to self, so changes to other editable fields (e.g., type, labels, description) are never
persisted. Refactor save() to build an update payload that includes all editable fields
(name, type, labels, description, etc.) by checking each attribute against Unset and
including it if set; pass that payload into projects_service.update(self.id, ...). After
the API call, update each corresponding local attribute from the returned
updated_project (only when the returned value is not Unset) before setting the SyncState
to SYNCED so the local state accurately reflects the persisted remote state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Project.save() only persists name but ProjectUpdate supports name, type_, labels, description.

  • Analysis: Same issue as Dataset. The docstring claims "persists any local modifications" but only name is sent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the mismatch between the docstring and what Project.save() actually persists; I’ll remember to treat this as a broader-field update issue (name, type_, labels, description) after the PR is merged.

Comment on lines +448 to +452
logger.info("Datasets.update: dataset_id=%s name=%s - started", dataset_id, name)

body = UpdateDatasetRequest(name=name)

detailed_response = update_dataset_datasets_dataset_id_patch.sync_detailed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets.update copies the same control flow as Projects.update (see src/galileo/projects.py lines 329‑354): both log start, build a body, call .sync_detailed, check for non‑200, HTTPValidationError, empty responses, and log completion. Any future change to error handling, logging, or validation must be manually synced in both places, which is a maintenance risk. Can we consolidate the shared update/response handling into a helper (e.g., a utility that executes the API call, checks status_code, unwraps validation errors, and returns the parsed model) and then only parameterize it with the request body, API call, and exception type for datasets vs. projects?

Finding type: Code Dedup and Conventions | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Analysis: Valid but low severity. The two methods share the same structure but differ in endpoint, request model, exception type, and response model. A shared helper would need to be generic over all of these, adding complexity for marginal benefit.
  • Recommended action: Defer. Not worth the abstraction for two call sites.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that makes sense—I'll treat this as a low-priority refactor and remember to defer this abstraction after the PR is merged.

dataset properties.
Persists any local modifications to the API. If the dataset has not been
created yet (LOCAL_ONLY state), this calls create() instead. If already
synced with no pending changes, this is a no-op.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • File: src/galileo/__future__/dataset.py (lines 529-531), src/galileo/__future__/project.py (lines 763-765)
  • The UpdateDatasetRequest supports name, column_mapping, draft. The ProjectUpdate supports name, type_, labels, description. Users who change dataset.draft = True and call save() will believe it's persisted when it isn't.
  • Suggestion: Pass all updatable fields through the service calls and sync them back from the response.


# Update attributes from response
self.name = updated_dataset.name
self.updated_at = updated_dataset.updated_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [Major]: Dataset.save() only syncs back name and updated_at from the API response, unlike create() and refresh() which sync all fields (num_rows, column_names, draft, etc.)
    • File: src/galileo/__future__/dataset.py (lines 570-572)
    • After a successful update, the local object could have stale values for fields returned by the API. The create() method (line 175-182) and refresh() method (lines 509-515) both update all attributes. save() should do the same for consistency.
    • Suggestion: After a successful update, sync all attributes from the response (or call self.refresh() to get the full object state).


return delete_dataset_datasets_dataset_id_delete.sync(client=self.config.api_client, dataset_id=dataset.id)

def update(self, dataset_id: str, *, name: Optional[str] = None) -> Dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ticket is about implementing methods in the __future__ package. This is a legacy function from the module that will be deprecated. Let's not add new functions there.


return Project(project=response)

def update(self, project_id: str, *, name: Optional[str] = None) -> Project:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ticket is about implementing methods in the __future__ package. This is a legacy function from the module that will be deprecated. Let's not add new functions there.

def test_save_without_id_raises_value_error(self, reset_configuration: None) -> None:
# Given: a dataset in DIRTY state with no ID
dataset = Dataset(name="Test Dataset")
dataset._set_state(SyncState.DIRTY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work as documented without having to manually set DIRTY:

`dataset.name = "renamed"; dataset.save()`

This is a pre-existing design gap (no property setters that auto-mark DIRTY), but the new save() method makes it user-facing.

Suggestion: Add property setters that auto-transition to DIRTY


response = detailed_response.parsed

if isinstance(response, HTTPValidationError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [Minor]: HTTPValidationError check after status_code != 200 is unreachable in Datasets.update()
    • File: src/galileo/datasets.py (lines 462-464)
    • A 422 validation error would already be caught by the status_code != 200 check on line 456. The HTTPValidationError isinstance check on line 462 would only fire if the server returned HTTP 200 with a validation error body, which is not a realistic scenario. Same applies to Projects.update() (line 344).
    • Suggestion: This is a defensive pattern used elsewhere in the codebase, so it's consistent. No change needed, but worth noting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants